Segmenting Chinese Unknown Words by Heuristic Method
نویسندگان
چکیده
Chinese text segmentation is important in Chinese text indexing. Due to the lack of word delimiters in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, the segmentation ambiguities and the occurrences of out-of-vocabulary words (i.e. unknown words) are the major challenges in Chinese segmentation. Many research works dealing with the problem of word segmentation have focused on the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. In this paper, we propose a heuristic method for Chinese test segmentation based on the statistical approach. The experimental result shows that our proposed heuristic method is promising to segment the unknown words as well as the known words. We have further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with our previous proposed technique, boundary detection.
منابع مشابه
A heuristic method based on a statistical approach for Chinese text segmentation
The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentat...
متن کاملChinese Unknown Word Identification Using Character-based Tagging and Chunking
Since written Chinese has no space to delimit words, segmenting Chinese texts becomes an essential task. During this task, the problem of unknown word occurs. It is impossible to register all words in a dictionary as new words can always be created by combining characters. We propose a unified solution to detect unknown words in Chinese texts. First, a morphological analysis is done to obtain i...
متن کاملA Heuristic Method for Chinese Segmentation
Research and development in digital library includes content creation, conversion, indexing, organization, and dissemination, where the key technological issues are how to search and display desired selections from and across large collections effectively [10]. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the ...
متن کاملDesign of CKIP Chinese Word Segmentation System
In this paper, we describe the design of the CKIP Chinese word segmentation system and analyse its performance. The system utilizes a modulized approach. Independent modules were designed to solve the problems of segmentation ambiguities and identifying unknown words. Segmentation ambiguities are resolved by a hybrid method of using heuristic and statistical rules. Regular-type unknown words ar...
متن کاملUnknown Word Identification for Chinese Morphological Analysis ∗
Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts becomes an essential task for Chinese language processing. Besides word segmentation, we also need to identify the part-of-speech (POS) tags of the words. The segmentation and POS tagging process are denoted as morphological analysis. During the process of word segmentation, two main problems o...
متن کامل